NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Compute-Optimal LLMs Provably Generalize Better With Scale

Finzi, M; Kapoor, S; Granziol, D; Gu, A; De_Sa, C; Kolter, JZ; Wilson, AG (April 2025, International Conference on Learning Representations)

Free, publicly-accessible full text available April 24, 2026
Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Potapczynski, A; Qiu, S; Finzi, M; Ferri, C; Chen, Z; Goldblum, M; Bruss, B; De_Sa, C; Wilson, AG (December 2024, Advances in Neural Information Processing Systems)

Dense linear layers are the dominant computational bottleneck in large neural networks, presenting a critical need for more efficient alternatives. Previous efforts focused on a small number of hand-crafted structured matrices and neglected to investigate whether these structures can surpass dense layers in terms of compute-optimal scaling laws when both the model size and training examples are optimally allocated. In this work, we present a unifying framework that enables searching among all linear operators expressible via an Einstein summation. This framework encompasses many previously proposed structures, such as low-rank, Kronecker, Tensor-Train, Block Tensor-Train (BTT), and Monarch, along with many novel structures. To analyze the framework, we develop a taxonomy of all such operators based on their computational and algebraic properties and show that differences in the compute-optimal scaling laws are mostly governed by a small number of variables that we introduce. Namely, a small ω (which measures parameter sharing) and large ψ (which measures the rank) reliably led to better scaling laws. Guided by the insight that full-rank structures that maximize parameters per unit of compute perform the best, we propose BTT-MoE, a novel Mixture-of-Experts (MoE) architecture obtained by sparsifying computation in the BTT structure. In contrast to the standard sparse MoE for each entire feed-forward network, BTT-MoE learns an MoE in every single linear layer of the model, including the projection matrices in the attention blocks. We find BTT-MoE provides a substantial compute-efficiency gain over dense layers and standard MoE.
more » « less
Full Text Available
Compute Better Spent: Replacing Dense Layers with Structured Matrices

Qiu, S; Potapczynski, A; Finzi, M; Goldblum, M; Wilson, Andrew G (July 2024, International Conference on Machine Learning)

Full Text Available
CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra

Potapczynski, Andres; Finzi, M; Pleiss, G; Wilson, Andrew G (December 2023, Advances in Neural Information Processing Systems)

Many areas of machine learning and science involve large linear algebra problems, such as eigendecompositions, solving linear systems, computing matrix exponentials, and trace estimation. The matrices involved often have Kronecker, convolutional, block diagonal, sum, or product structure. In this paper, we propose a simple but general framework for large-scale linear algebra problems in machine learning, named CoLA (Compositional Linear Algebra). By combining a linear operator abstraction with compositional dispatch rules, CoLA automatically constructs memory and runtime efficient numerical algorithms. Moreover, CoLA provides memory efficient automatic differentiation, low precision computation, and GPU acceleration in both JAX and PyTorch, while also accommodating new objects, operations, and rules in downstream packages via multiple dispatch. CoLA can accelerate many algebraic operations, while making it easy to prototype matrix structures and algorithms, providing an appealing drop-in tool for virtually any computational effort that requires linear algebra. We showcase its efficacy across a broad range of applications, including partial differential equations, Gaussian processes, equivariant model construction, and unsupervised learning.
more » « less
Full Text Available
SKIing on Simplices: Kernel Interpolation on the Permutohedral Lattice for Scalable Gaussian Processes

Kapoor, S; Finzi, M; Wang, A; Wilson, AG (January 2021, International Conference on Machine Learning (ICML))
null (Ed.)
Full Text Available
Generalizing Convolutional Neural Networks for Equivarianceto Lie Groups on Arbitrary Continuous Data

Finzi, M; Stanton, S; Izmailov, P; Wilson, A.G. (January 2020, International Conference on Machine Learning)

The translation equivariance of convolutional layers enables convolutional neural networks to generalize well on image problems. While translation equivariance provides a powerful inductive bias for images, we often additionally desire equivariance to other transformations, such as rotations, especially for non-image data. We propose a general method to construct a convolutional layer that is equivariant to transformations from any specified Lie group with a surjective exponential map. Incorporating equivariance to a new group requires implementing only the group exponential and logarithm maps, enabling rapid prototyping. Showcasing the simplicity and generality of our method, we apply the same model architecture to images, ball-and-stick molecular data, and Hamiltonian dynamical systems. For Hamiltonian systems, the equivariance of our models is especially impactful, leading to exact conservation of linear and angular momentum.
more » « less
Full Text Available
Generalizing Convolutional Networks for Equivariance to Lie Groups on Arbitrary Continuous Data.

Finzi, M; Stanton, S; Izmailov, P; Wilson, A.G. (January 2020, Proceedings of the International Conference on Machine Vision and Machine Learning)

The translation equivariance of convolutional layers enables convolutional neural networks to generalize well on image problems. While translation equivariance provides a powerful inductive bias for images, we often additionally desire equivariance to other transformations, such as rotations, especially for non-image data. We propose a general method to construct a convolutional layer that is equivariant to transformations from any specified Lie group with a surjective exponential map. Incorporating equivariance to a new group requires implementing only the group exponential and logarithm maps, enabling rapid prototyping. Showcasing the simplicity and generality of our method, we apply the same model architecture to images, ball-and-stick molecular data, and Hamiltonian dynamical systems. For Hamiltonian systems, the equivariance of our models is especially impactful, leading to exact conservation of linear and angular momentum.
more » « less
Full Text Available
Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

Finzi, M; Stanton, S; Izmailov, P; Wilson, AG (January 2020, International Conference on Machine Learning)

The translation equivariance of convolutional layers enables convolutional neural networks to generalize well on image problems. While translation equivariance provides a powerful inductive bias for images, we often additionally desire equivariance to other transformations, such as rotations, especially for non-image data. We propose a general method to construct a convolutional layer that is equivariant to transformations from any specified Lie group with a surjective exponential map. Incorporating equivariance to a new group requires implementing only the group exponential and logarithm maps, enabling rapid prototyping. Showcasing the simplicity and generality of our method, we apply the same model architecture to images, ball-and-stick molecular data, and Hamiltonian dynamical systems. For Hamiltonian systems, the equivariance of our models is especially impactful, leading to exact conservation of linear and angular momentum.
more » « less
Full Text Available

Search for: All records